WEEK 2: TIDY DATA + BASICS OF GRAPHICS

Tuesday, January 17th

Today we will…

Tidy Data

Tidy Data

Artwork by Allison Horst

Same Data, Different Formats

Which data follows a tidy data format?

Team Points Assists Rebounds
A 88 12 22
B 91 17 28
C 99 24 30
D 94 28 31
Team Variable Value
A Points 88
A Assists 12
A Rebounds 22
B Points 91
B Assists 17
B Rebounds 28
C Points 99
C Assists 24
C Rebounds 30
D Points 94
D Assists 28
D Rebounds 31

Tidy Data

Artwork by Allison Horst

Working with External Data

Common Types of Data Files

.csv : “Comma-separated”

Name, Age
Bob, 49
Joe, 40


.xls, .xlsx: Microsoft Excel Spreadsheet - Common approach: save as .csv - Nicer approach: readxl package

.txt: Plain text - Could be just text - Could be comma-separated data - Could be tab-separated, bar-separated, etc. - Need to let R know what to look for

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

  • read_csv() works like read.csv, with some extra stuff

  • read_tsv() is for tab-separated data

  • read_table() is for any data with “columns” (white space separating)

  • read_delim() is for special “delimiters” separating data

  • read_excel() is specifically for dealing with Excel files

Grammar of Graphics

Grammar of Graphics: graphic forms from the ground up

Think of a data visualization or graph as a mapping

  • from variables in the data set, (or statistics computed from the data)
  • to visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen

Grammar of Graphics: why both?

It’s not just a neat party trick!

  • More flexible than “chart zoo” of named graphs
  • Software understands the structure of your graph
    • easily automate small multiples for data subsets

Note

“[The grammar] makes it easier for you to iteratively update a plot, changing a single feature at a time. The grammar is also useful because it suggests the high-level aspects of a plot that can be changed, giving you a framework to think about graphics, and hopefully shortening the distance from mind to paper. It also encourages the use of graphics customised to a particular problem, rather than relying on generic named graphics.

Grammar of Graphics: components

GoG components, as specified in R’s ggplot2

  • data
  • aes : aesthetic mappings (position, length, color, symbol, …)
  • geom : geometric element (point, line, bar, …)
  • stat : statistical variable transformation (identity, count, linear model, quantile, …)
  • scale : scale transformation (log scale, color mapping, axes tick breaks, …)
  • coord : Cartesian, polar, map projection, …
  • facet : divide into subplots / small multiples using a categorical variable

Of course, we can also control axes, legends, titles … (guides)

Using ggplot2

How to build a graph

How to build a graph

This will begin a plot that you can finish by adding layers to.

ggplot(data = mpg)

ggplot(data = mpg, 
       aes(x = class, y = hwy) #<<
       )

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter()

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter() +
  geom_boxplot()

How would you change the code to have the points on top of the boxplots?

Aesthetics

In ggplot2, we map variables from the data set to aesthetics on the chart

Code
ggplot(data = txhousing, aes(x = date, y = median, color = city)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  xlab("Date") + ylab("Median Home Price") + 
  ggtitle("Texas Housing Prices")

Not an exhaustive list – see ggplot2 cheat sheet

  • x, y
  • color and fill
  • linetype
  • lineend
  • linejoin
  • size
  • shape

Global Aesthetics

ggplot(data = housingsub, 
       mapping = aes(x = date, 
                     y = median)
       ) +
  geom_point()

Local Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median)
             )

Mapping Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median,
                           color = city)
             )

Setting Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median, 
                           color = city), 
             color = "blue"
               )

Geometric objects

In ggplot2, we use a geom function to represent data points, and use the geom’s aesthetic properties to represent variables.

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_point() +
  labs(xlab = "City (mpg)", y = "Highway (mpg)")

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_text(aes(label = class)) +
  labs(xlab = "City (mpg)", y = "Highway (mpg)")

Not an exhaustive list – see ggplot2 cheat sheet

one variable

  • geom_density()
  • geom_dotplot()
  • geom_histogram()
  • geom_boxplot()

two variable

  • geom_point()
  • geom_line()
  • geom_density_2d()

three variable

  • geom_contour()
  • geom_raster()

Once our data is formatted and we know what type of variables we are working with, we can select the correct geom for our visualization.

Alternative method of building layers: Stats

A stat builds a new variable to plot (e.g., count and proportion)

Faceting

A way to extract subsets of data and place them side-by-side in graphics

Note

sometimes called small multiples

ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) + 
  geom_point() +
  facet_grid(~class)

  • facet_grid(. ~ b): facet into columns based on b
  • facet_grid(a ~ .): facet into columns based on a
  • facet_grid(a ~ b): facet into both rows and columns
  • facet_wrap( ~ fl): wrap facets into a rectangular layout

You can set scales to let axis limits vary across facets:

  • facet_grid(y ~ x, scales = "free"): x and y axis limits adjust to individual facets
    • “free_x” - x axis limits adjust
    • “free_y” - y axis limits adjust

You can also set a labeller to adjust facet labels:

  • facet_grid(. ~ fl, labeller = label_both)
  • facet_grid(. ~ fl, labeller = label_bquote(alpha ^ .(x)))
  • facet_grid(. ~ fl, labeller = label_parsed)

Position Adjustements

Position adjustments determine how to arrange geoms that would otherwise occupy the same space

  • position = 'dodge': Arrange elements side by side
  • position = 'fill': Stack elements on top of one another, normalize height
  • position = 'stack': Stack elements on top of one another
  • position = 'jitter": Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter())
ggplot(mpg, aes(fl, fill = drv)) + 
  geom_bar(position = "")`

Plot Customizations

Clearer labels with labs()

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x = "Engine Displacement (liters)", 
       y = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency")

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)", 
       ylab = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  theme_bw() +
  theme(legend.position = "bottom")

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_y_continuous("Highway MPG", 
                     limits = c(0,50),
                     breaks = seq(0,50,5)
                     )

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)",
       ylab = "Highway MPG",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_color_gradient(low = "white", high = "green4")

Formatting your Plot Code

Tip

Notice how there is a lot of nesting that happens within ggplot2 code (e.g., parentheses within parentheses). It is good practice to put each geom and aesthetic on a new line. This makes code easier to read!

The general guideline is that each line of your code should not be over 80 characters long.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(xlab = "City (mpg)", ylab = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, 
                     y = hwy, 
                     color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(xlab = "City (mpg)", 
       ylab = "Highway (mpg)"
       )
ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(xlab = "City (mpg)", ylab = "Highway (mpg)")

PA 2: Using Data Visualization to Find the Penguins

Artwork by Allison Horst

Tip

I encourage you to use your neighbors for support!

To do…

  • PA 2: Using Data Visualization to Find the Penguins
    • Due TOMORROW, Wednesday (1/18) at 8:00AM
  • Bonus Challenge: Ugly Graphics of Penguins (+2)
    • Due TOMORROW, Wednesday (1/18) at 10:10AM

Note

I have office hours TODAY, Tuesday (1/17) from 2:40pm - 3pm in 25-103

Wednesday, January 18th

Today we will…

  • Review PA 2: Using Data Visualization to Find the Penguins
  • Ugly Graphics of Penguins
  • Mini lecture on text material
    • What makes a good graphic?
  • Lab 2: Exploring Rodents with ggplot2
  • Challenge 2: Spicing things up with ggplot2

What makes a good graphic?

Gestalt Principles

To do…

  • Lab 2: Exploring Rodents with ggplot2

  • Challenge 2: Spicing things up with ggplot2

  • Read Chapter 3: Data Cleaning and Manipulation

    • Concept Check 3.1 + 3.2 due Monday (1/23) at 8AM